29 research outputs found
1-PAGER: One Pass Answer Generation and Evidence Retrieval
We present 1-Pager the first system that answers a question and retrieves
evidence using a single Transformer-based model and decoding process. 1-Pager
incrementally partitions the retrieval corpus using constrained decoding to
select a document and answer string, and we show that this is competitive with
comparable retrieve-and-read alternatives according to both retrieval and
answer accuracy metrics. 1-Pager also outperforms the equivalent closed-book
question answering model, by grounding predictions in an evidence corpus. While
1-Pager is not yet on-par with more expensive systems that read many more
documents before generating an answer, we argue that it provides an important
step toward attributed generation by folding retrieval into the
sequence-to-sequence paradigm that is currently dominant in NLP. We also show
that the search paths used to partition the corpus are easy to read and
understand, paving a way forward for interpretable neural retrieval.Comment: Accepted at EMNLP 2023 (Findings
NAIL: Lexical Retrieval Indices with Efficient Non-Autoregressive Decoders
Neural document rerankers are extremely effective in terms of accuracy.
However, the best models require dedicated hardware for serving, which is
costly and often not feasible. To avoid this serving-time requirement, we
present a method of capturing up to 86% of the gains of a Transformer
cross-attention model with a lexicalized scoring function that only requires
10-6% of the Transformer's FLOPs per document and can be served using commodity
CPUs. When combined with a BM25 retriever, this approach matches the quality of
a state-of-the art dual encoder retriever, that still requires an accelerator
for query encoding. We introduce NAIL (Non-Autoregressive Indexing with
Language models) as a model architecture that is compatible with recent
encoder-decoder and decoder-only large language models, such as T5, GPT-3 and
PaLM. This model architecture can leverage existing pre-trained checkpoints and
can be fine-tuned for efficiently constructing document representations that do
not require neural processing of queries.Comment: To appear at EMNLP 202
MICK: A Meta-Learning Framework for Few-shot Relation Classification with Small Training Data
Few-shot relation classification seeks to classify incoming query instances
after meeting only few support instances. This ability is gained by training
with large amount of in-domain annotated data. In this paper, we tackle an even
harder problem by further limiting the amount of data available at training
time. We propose a few-shot learning framework for relation classification,
which is particularly powerful when the training data is very small. In this
framework, models not only strive to classify query instances, but also seek
underlying knowledge about the support instances to obtain better instance
representations. The framework also includes a method for aggregating
cross-domain knowledge into models by open-source task enrichment.
Additionally, we construct a brand new dataset: the TinyRel-CM dataset, a
few-shot relation classification dataset in health domain with purposely small
training data and challenging relation classes. Experimental results
demonstrate that our framework brings performance gains for most underlying
classification models, outperforms the state-of-the-art results given small
training data, and achieves competitive results with sufficiently large
training data
Calibrating Likelihoods towards Consistency in Summarization Models
Despite the recent advances in abstractive text summarization, current
summarization models still suffer from generating factually inconsistent
summaries, reducing their utility for real-world application. We argue that the
main reason for such behavior is that the summarization models trained with
maximum likelihood objective assign high probability to plausible sequences
given the context, but they often do not accurately rank sequences by their
consistency. In this work, we solve this problem by calibrating the likelihood
of model generated sequences to better align with a consistency metric measured
by natural language inference (NLI) models. The human evaluation study and
automatic metrics show that the calibrated models generate more consistent and
higher-quality summaries. We also show that the models trained using our method
return probabilities that are better aligned with the NLI scores, which
significantly increase reliability of summarization models
New Protocols and Negative Results for Textual Entailment Data Collection
Natural language inference (NLI) data has proven useful in benchmarking and,
especially, as pretraining data for tasks requiring language understanding.
However, the crowdsourcing protocol that was used to collect this data has
known issues and was not explicitly optimized for either of these purposes, so
it is likely far from ideal. We propose four alternative protocols, each aimed
at improving either the ease with which annotators can produce sound training
examples or the quality and diversity of those examples. Using these
alternatives and a fifth baseline protocol, we collect and compare five new
8.5k-example training sets. In evaluations focused on transfer learning
applications, our results are solidly negative, with models trained on our
baseline dataset yielding good transfer performance to downstream tasks, but
none of our four new methods (nor the recent ANLI) showing any improvements
over that baseline. In a small silver lining, we observe that all four new
protocols, especially those where annotators edit pre-filled text boxes, reduce
previously observed issues with annotation artifacts.Comment: To appear at EMNLP 202